This project focuses on using data as a tool to assess, analyze, and eventually predict housing prices in Mecklenburg County, North Carolina. This sort of robust descriptive data analytics can help inform policy and help decision makers/stakeholders understand the future demand for land, housing, education, and transportation needs of a neighborhood, city, and the greater region at large. This project was formulated using similar regression, and feature modeling from Zillow housing estimates, to train a model from recent sale prices, and the training set, and assess whether our model generalizes to properties in the same areas that have not sold yet.
This project required several machine learning processes for us to better understand the housing data, the variables within the housing data, and any external data sources we decided to add. Data wrangling, exploratory analysis, feature engineering, feature selection, and model estimation and validation were all machine learning processes used as major steps in this predictive modeling exercise.
In order to better understand/explain sale prices in particular areas, Zillow uses feature modeling which requires manipulating both internal housing characteristics and external characteristics. This hedonic model was used in our research approach, to predict sale prices in Mecklenburg as a theoretical framework. It functions as a way to break down the value of a house price into the value of the characteristics that constitute it. Internal housing characteristics such as the number of bathrooms, the age of the house, and the number of bedrooms, for example, were some of the variables used to evaluate which might impact the sale price of a house in the county. Public services/amenities like tree canopy, or spatial processes of clustering of homes around schools or parks, were some external variables used to attempt to create an accurate and generalizable model.
This project was difficult and required us to utilize many of the technical and analytical skills learned thus far in the semester. However, the most difficult was understanding the value of what homebuyers want. Buying a home is such a nuanced process and there are numerous variables, amenities, and disamenities that inform a potential home buyer’s decision. We learned the power of machine learning but also realized there are limitations to predicting how human needs and desires.
The dataset imported contains the attributes of homes in Mecklenburg, NC such as prices, bedrooms, bathrooms, etc. We will utilize these attributes to inform our model to predict home prices eventually.
HousingData <-
st_read("studentData.geojson")%>%
st_transform('EPSG:2264')
Mecklenburg <-
st_read("MecklenburgCounty_Boundary.shp")%>%
st_transform('EPSG:2264')
Mecklenburg_Tracts <-
st_read("npa.shp")%>%
st_transform('EPSG:2264')
Starting with the housing data set as it is, we decided to group, manipulate, and create new variables in the dataset solely based on the attributes of the homes. Our new “internal variables” are included in the model to predict home prices.
# Total baths
HousingData <-
HousingData %>%
mutate(totalbaths = (halfbaths *0.5)+(fullbaths))
# Age
HousingData <-
HousingData %>%
mutate(Age = 2022 - yearbuilt)
# Old vs. New Houses, Old = 0 and new = 1
HousingData$houseoldnew = ifelse(HousingData$"yearbuilt"<2000, 0 , 1)
# Bedroom Groupings, small houses = 1, medium = 2, large = 3
HousingData <- mutate(HousingData, bedroomgroups = case_when(bedrooms <= 2 ~ 1,
bedrooms >= 3 & bedrooms <= 4 ~ 2,
bedrooms >= 5 ~ 3))
# Story groupings, One level = 1, multilevel = 2
HousingData$storygroups = ifelse(HousingData$"storyheigh"== "1 STORY", 1, 2)
We included some demographics, and socio-economic and environmental data about Mecklenburg, that are spatially tied to census tracts rather than points. The data was extracted from the 2019 5-year ACS and Mecklenburg’s open data source.
We included amenities and services, we thought would impact the variability of housing prices across the county and neighborhoods. We calculated distances from these amenities as well as buffers to each home observation to determine their spatial impact on sale prices. Schools and parks data were sourced from Mecklenburg’s open data source.
# Public schools
PublicSchools <-
st_read("CMS_Schools.shp")%>%
st_transform('EPSG:2264')%>%
dplyr::select(-school_num,-address,-city,-zipcode,-school,-magnet,-grdlevl,-mag_focus,-schl_stat)
PublicSchools1 <-
st_read("CMS_Schools.shp")%>%
st_transform('EPSG:2264')%>%
dplyr::select(-school_num,-address,-city,-zipcode,-school,-magnet,-grdlevl,
-mag_focus,-schl_stat,-school_typ,)
# Parks
ParkAmenities <-
st_read("ParkAmenity.shp")%>%
st_transform('EPSG:2264')%>%
dplyr::select(geometry)
In order to use the external variables, we need to add them to the housing dataset by expressing how they relate to each home observation. We will utilize buffers and average nearest neighbor distance to measure occurrences of the amenities and the from the homes to the distances to them.
The buffer sums schools within 3 miles of each home sale observation.
HousingData$schools.Buffer =
st_buffer(HousingData, 15840) %>%
aggregate(mutate(PublicSchools1, counter = 1),., sum) %>%
pull(counter)
This calculates the average nearest neighbor distance from each home sale to its k nearest school.
HousingData <-
HousingData %>%
mutate(
school_nn1 = nn_function(st_coordinates(HousingData),
st_coordinates(PublicSchools1), k = 1),
school_nn2 = nn_function(st_coordinates(HousingData),
st_coordinates(PublicSchools1), k = 2),
school_nn3 = nn_function(st_coordinates(HousingData),
st_coordinates(PublicSchools1), k = 3),
school_nn4 = nn_function(st_coordinates(HousingData),
st_coordinates(PublicSchools1), k = 4),
school_nn5 = nn_function(st_coordinates(HousingData),
st_coordinates(PublicSchools1), k = 5))
ggplot() + geom_sf(data = Mecklenburg_Tracts, fill = "grey40") +
stat_density2d(data = data.frame(st_coordinates(PublicSchools1)),
aes(X, Y, fill = ..level.., alpha = ..level..),
size = 0.01, bins = 40, geom = 'polygon') +
scale_fill_gradient(low = "#ecda9a", high= "#ee4d5a", name = "Density") +
scale_alpha(range = c(0.00, 0.35), guide = FALSE) +
labs(title = "Density of Schools, Mecklenburg County") +
mapTheme()
## Warning: It is deprecated to specify `guide = FALSE` to remove a guide. Please
## use `guide = "none"` instead.
The buffer sums parks within 1/2 mile of each home sale observation.
HousingData$Park.Buffer =
st_buffer(HousingData, 2640) %>%
aggregate(mutate(ParkAmenities, counter = 1),., sum) %>%
pull(counter)
HousingData$Park.Buffer[is.na(HousingData$Park.Buffer)] <- 0
This map illustrates the distribution of home prices across the county. From this map, you can notice some patterns of spatial clustering of higher and lower home prices which we hope to capture in our model.
ggplot() +
geom_sf(data = Mecklenburg_Tracts, fill = "grey") +
geom_sf(data = Mecklenburg_Tracts, color = "white") +
geom_sf(data = HousingData, aes(colour = q5(price)),
show.legend = "point", size = .25) +
scale_colour_manual(values = (palette11),
labels=qBr(HousingData,"price"),
name="Quintile\nBreaks") +
labs(title= "Home Prices", subtitle = "Mecklenburg County, NC") +
guides(color = guide_legend(override.aes = list(size = 5)))+
mapTheme()
This is a map of one of the internal housing characteristics that were included in our model.
ggplot() +
geom_sf(data = Mecklenburg_Tracts, fill = "grey") +
geom_sf(data = Mecklenburg_Tracts, color = "white") +
geom_sf(data = HousingData, aes(colour = bldggrade),
show.legend = "point", size = .25) +
scale_colour_manual(values = rev(palette13)) +
labs(title= "Home Quality Description", subtitle = "Mecklenburg County, NC") +
guides(color = guide_legend(override.aes = list(size = 5)))+
mapTheme()
The maps below show the distribution of amenities/services overlapped with home prices in the county.
ggplot() +
geom_sf(data = Mecklenburg_Tracts, fill = "grey") +
geom_sf(data = Mecklenburg_Tracts, color = "white") +
geom_sf(data = HousingData, aes(colour = q5(price)),
show.legend = "point", size = .25, alpha = .15 )+
scale_colour_manual(values = (palette11),
labels=qBr(HousingData,"price"),
name="Quintile\nBreaks") +
geom_sf(data = PublicSchools, size = 1.25, color = "#589adb") +
labs(title= "Public Schools", subtitle = "Mecklenburg County, NC") +
guides(color = guide_legend(override.aes = list(size = 5)))+
mapTheme()
ggplot() +
geom_sf(data = Mecklenburg_Tracts, fill = "grey") +
geom_sf(data = Mecklenburg_Tracts, color = "white") +
geom_sf(data = HousingData, aes(colour = q5(price)),
show.legend = "point", size = .25, alpha = .15 )+
scale_colour_manual(values = (palette11),
labels=qBr(HousingData,"price"),
name="Quintile\nBreaks") +
geom_sf(data = ParkAmenities, size = 1.0, color = "#54bebe") +
labs(title= "Park Amenities", subtitle = "Mecklenburg County") +
guides(color = guide_legend(override.aes = list(size = 5)))+
mapTheme()
The maps below illustrate the relationship between spatial variables and home prices across the county.
ggplot() +
geom_sf(data = pctBachelor, aes(fill = (Bachelor_2020)))+
labs(title= "% of Bachelor Degree Holders", subtitle = "Mecklenburg County, NC") +
geom_sf(data = HousingData, alpha = .25, aes(colour = q5(price)),
show.legend = "point", size = .10)+
scale_colour_manual(values = (palette11),
labels=qBr(HousingData,"price"),
name="Quintile\nBreaks") +
guides(color = guide_legend(override.aes = list(size = 5)))+
mapTheme()
ggplot() +
geom_sf(data = Income, aes(fill = (Income_2020)))+
labs(title= "Median Income Distribution", subtitle = "Mecklenburg County, NC") +
geom_sf(data = HousingData, alpha = .25, aes(colour = q5(price)),
show.legend = "point", size = .10)+
scale_colour_manual(values = (palette11),
labels=qBr(HousingData,"price"),
name="Quintile\nBreaks") +
guides(color = guide_legend(override.aes = list(size = 5)))+
mapTheme()
These final maps include spatial variables whose relationship with home prices appeared to be interesting. For example, looking at the maps illustrating the percentage of Black/White residents overlapped with the home prices, you can immediately see how home prices differ by race in census tracts with a majority of Black/White residents.
ggplot() +
geom_sf(data = pctBlack, aes(fill = (Black_2020)))+
labs(title= "% of Black Residents", subtitle = "Mecklenburg County, NC") +
geom_sf(data = HousingData, alpha = .25, aes(colour = q5(price)),
show.legend = "point", size = .10)+
scale_colour_manual(values = (palette11),
labels=qBr(HousingData,"price"),
name="Quintile\nBreaks") +
guides(color = guide_legend(override.aes = list(size = 5)))+
mapTheme()
ggplot() +
geom_sf(data = pctWhite, aes(fill = (White_2020)))+
labs(title= "% of White Residents", subtitle = "Mecklenburg County, NC") +
geom_sf(data = HousingData, alpha = .25, aes(colour = q5(price)),
show.legend = "point", size = .10)+
scale_colour_manual(values = (palette11),
labels=qBr(HousingData,"price"),
name="Quintile\nBreaks") +
guides(color = guide_legend(override.aes = list(size = 5)))+
mapTheme()
ggplot() +
geom_sf(data = pctTreeCanopy, aes(fill = (Treecanopy_2012)))+
labs(title= "% Tree Cover", subtitle = "Mecklenburg County, NC") +
geom_sf(data = HousingData, alpha = .25, aes(colour = q5(price)),
show.legend = "point", size = .10)+
scale_colour_manual(values = (palette11),
labels=qBr(HousingData,"price"),
name="Quintile\nBreaks") +
guides(color = guide_legend(override.aes = list(size = 5)))+
mapTheme()
The tables below show summary statistics for our internal and external values.
# Internal variables
ss_Internal <- HousingData
ss_Internal <- st_drop_geometry(ss_Internal)
ss_Internal <- ss_Internal %>%
dplyr::select("totalbaths", "price", "fullbaths", "halfbaths", "Age", "bedrooms", "numfirepla", "bldggrade", "heatedfuel", "bedroomgroups", "yearbuilt", "actype", "extwall", "foundation", "storyheigh", "heatedarea", "houseoldnew")
stargazer(as.data.frame(ss_Internal), type="text", digits=1, title = "Descriptive Statistics for Internal Variables of Homes in Meckenburg, NC", out = "Training_Internal.txt")
##
## Descriptive Statistics for Internal Variables of Homes in Meckenburg, NC
## =======================================================
## Statistic N Mean St. Dev. Min Max
## -------------------------------------------------------
## totalbaths 46,183 2.6 0.9 0.0 11.0
## price 46,183 465,686.8 602,409.8 0 75,500,000
## fullbaths 46,183 2.3 0.8 0 9
## halfbaths 46,183 0.6 0.5 0 4
## Age 46,183 28.6 31.8 0 2,022
## bedrooms 46,183 3.5 0.9 0 65
## numfirepla 46,183 0.8 0.5 0 7
## bedroomgroups 46,183 2.1 0.4 1 3
## yearbuilt 46,183 1,993.4 31.8 0 2,022
## heatedarea 46,183 2,359.4 1,060.3 0.0 14,710.0
## houseoldnew 46,183 0.5 0.5 0 1
## -------------------------------------------------------
# Amenities/Services
ss_Amenities<- HousingData
ss_Amenities <- st_drop_geometry(ss_Amenities)
ss_Amenities <- ss_Amenities %>%
dplyr::select("school_nn1", "Park.Buffer")
stargazer(as.data.frame(ss_Amenities), type="text", digits=1, title="Descriptive Statistics for Amenities and Services near Homes in Meckenburg, NC", out = "Training_External.txt")
##
## Descriptive Statistics for Amenities and Services near Homes in Meckenburg, NC
## ==================================================
## Statistic N Mean St. Dev. Min Max
## --------------------------------------------------
## school_nn1 46,183 4,759.2 3,070.7 117.0 22,481.3
## Park.Buffer 46,183 5.5 10.8 0 132
## --------------------------------------------------
# Spatial Structure
ss_Spatial<- HousingData
ss_Spatial <- st_drop_geometry(ss_Spatial)
ss_Spatial <- ss_Spatial %>%
dplyr::select("Black_2020", "White_2020", "Income_2020", "Treecanopy_2012", "Bachelor_2020")
stargazer(as.data.frame(ss_Spatial), type="text", digits=1, title="Descriptive Statistics for Spatial Structures near Homes in Meckenburg, NC", out = "Training_External2.txt")
##
## Descriptive Statistics for Spatial Structures near Homes in Meckenburg, NC
## =======================================================
## Statistic N Mean St. Dev. Min Max
## -------------------------------------------------------
## Black_2020 46,181 27.0 21.2 0.4 87.2
## White_2020 46,181 49.5 26.2 1.5 93.3
## Income_2020 45,201 89,895.1 42,113.8 10,797 250,001
## Treecanopy_2012 46,165 54.0 12.4 1.6 77.0
## Bachelor_2020 46,181 47.2 20.8 4.2 91.6
## -------------------------------------------------------
In order to train and test our model, we remove the rows where the home prices are zeros to prepare our model for prediction. The challenge set will be made of 100 variables with no home prices which we will predict after our model is trained and tested.
# Challenge set
Challenge_Meclenburg<- filter(HousingData, price == 0)
# Remove the zeros
HousingData_nozero <-HousingData[!(HousingData$price== 0),]
The following graphs will begin to analyze the relationship between our variables and home sale prices. These graphs will be helpful to determine significant variables in our data set that should be added to our model.
# Home Features cor
st_drop_geometry(HousingData_nozero) %>%
dplyr::select(price, Age, bedrooms, numfirepla) %>%
filter(price <= 8000000, Age<200, bedrooms<10) %>%
gather(Variable, Value, -price) %>%
ggplot(aes(Value, price)) +
geom_point(size = .5) + geom_smooth(method = "lm", se=F, colour = "pink") +
facet_wrap(~Variable, ncol = 4, scales = "free") +
labs(title = "Price as a function of continuous variables") +
plotTheme()
#bldggrade, heatedfuel,bedroomgroups, storygroups
st_drop_geometry(HousingData_nozero) %>%
dplyr::select(price, bedroomgroups, storygroups, totalbaths) %>%
filter(price <= 8000000) %>%
gather(Variable, Value, -price) %>%
ggplot(aes(Value, price)) +
geom_point(size = .5) + geom_smooth(method = "lm", se=F, colour = "pink") +
facet_wrap(~Variable, ncol = 4, scales = "free") +
labs(title = "Price as a function of continuous variables") +
plotTheme()
# School Corr
HousingData_nozero %>%
st_drop_geometry() %>%
dplyr::select(price, starts_with("school_")) %>%
filter(price <= 8000000) %>%
gather(Variable, Value, -price) %>%
ggplot(aes(Value, price)) +
geom_point(size = .5) + geom_smooth(method = "lm", se=F, colour = "pink") +
facet_wrap(~Variable, nrow = 1, scales = "free") +
labs(title = "Price as a function of continuous variables") +
plotTheme()
numericVars <-
select_if(st_drop_geometry(HousingData_nozero), is.numeric) %>% na.omit()
ggcorrplot(
round(cor(numericVars), 1),
tl.cex=5,
p.mat = cor_pmat(numericVars),
colors = c("#f3e79b","#eb7f86","#5c53a5","#a059a0", "#fac484"),
type="lower",
insig = "blank") +
labs(title = "Correlation across numeric variables")
#palette11<- c("#f3e79b","#fac484","#f8a07e","#eb7f86","#ce6693","#a059a0","#5c53a5")
Our Initial regression model includes the adjusted internal and external variables that have been added, as well as the attributes of the home that were present in the housing data. The adjusted R-squared values show that this model accounts for 24% of the variation in home sale prices.
##
## Call:
## lm(formula = price ~ ., data = st_drop_geometry(HousingData_nozero) %>%
## dplyr::select(price, fullbaths, halfbaths, Age, bedrooms,
## numfirepla, bldggrade, heatedfuel, bedroomgroups, yearbuilt,
## actype, extwall, foundation, storyheigh, heatedarea,
## houseoldnew, Income_2020, Treecanopy_2012, White_2020,
## Black_2020, Bachelor_2020, Park.Buffer, schools.Buffer))
##
## Residuals:
## Min 1Q Median 3Q Max
## -2379200 -83657 -15067 56436 74496603
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.742e+05 3.032e+05 -1.234 0.217241
## fullbaths 3.011e+04 5.841e+03 5.155 2.54e-07 ***
## halfbaths 2.735e+04 6.384e+03 4.285 1.83e-05 ***
## Age 1.606e+03 2.269e+02 7.076 1.50e-12 ***
## bedrooms 3.988e+04 4.227e+03 9.435 < 2e-16 ***
## numfirepla -5.384e+03 6.188e+03 -0.870 0.384254
## bldggradeCUSTOM 1.493e+06 3.957e+04 37.741 < 2e-16 ***
## bldggradeEXCELLENT 7.139e+05 2.213e+04 32.262 < 2e-16 ***
## bldggradeFAIR -2.748e+04 2.123e+04 -1.294 0.195554
## bldggradeGOOD 5.553e+04 7.579e+03 7.327 2.39e-13 ***
## bldggradeMINIMUM -5.280e+04 1.147e+05 -0.460 0.645356
## bldggradeVERY GOOD 2.581e+05 1.238e+04 20.844 < 2e-16 ***
## heatedfuelGAS -1.841e+04 7.977e+03 -2.308 0.021011 *
## heatedfuelNONE -2.393e+04 6.721e+04 -0.356 0.721854
## heatedfuelOIL/WD/COAL 6.569e+04 3.584e+04 1.833 0.066788 .
## heatedfuelSOLAR/GEOTHRM -1.237e+04 1.889e+05 -0.066 0.947760
## bedroomgroups -5.091e+04 9.582e+03 -5.313 1.08e-07 ***
## yearbuilt NA NA NA NA
## actypeAC-CHLD WAT -1.025e+05 3.013e+05 -0.340 0.733608
## actypeAC-NONE -4.309e+04 1.536e+04 -2.805 0.005037 **
## actypeAC-PCKD ROOF 1.054e+04 5.216e+05 0.020 0.983881
## actypeAC-WALL UNIT -4.803e+04 9.150e+04 -0.525 0.599641
## actypeAIR-DUCTED -3.024e+03 7.965e+04 -0.038 0.969717
## actypeAIR-NO-DUCT -3.738e+04 1.845e+05 -0.203 0.839443
## actypeHEAT - NONE 1.117e+05 3.750e+05 0.298 0.765878
## actypeHEAT PUMP 5.458e+04 6.810e+04 0.801 0.422857
## extwallASB SHNG/SDG -8.251e+04 4.292e+04 -1.922 0.054566 .
## extwallBOARD&BATTEN 9.352e+04 8.725e+04 1.072 0.283813
## extwallCEDAR,RDWD 8.334e+04 2.675e+04 3.116 0.001837 **
## extwallCEM BR/SPL B -2.630e+04 9.112e+04 -0.289 0.772906
## extwallCOMP OR WLBD -3.447e+04 2.337e+05 -0.148 0.882737
## extwallCONC BLOCK -3.424e+03 9.175e+04 -0.037 0.970230
## extwallCORR MTL LGT -1.837e+05 3.890e+05 -0.472 0.636768
## extwallEXT PLYWOOD 1.109e+05 4.366e+04 2.539 0.011123 *
## extwallFACE BRICK -1.567e+04 8.586e+03 -1.825 0.067976 .
## extwallHARDIPLK/DSGN VINYL 5.783e+03 8.311e+03 0.696 0.486574
## extwallIMITATION STONE 1.413e+05 3.690e+05 0.383 0.701706
## extwallLOG 4.057e+04 1.449e+05 0.280 0.779418
## extwallMASONITE 2.747e+03 1.010e+04 0.272 0.785545
## extwallMODULAR MTL -8.691e+04 3.095e+05 -0.281 0.778845
## extwallPREFIN MTL -5.713e+04 3.015e+05 -0.190 0.849691
## extwallSDG MIN/NONE -1.995e+03 3.689e+05 -0.005 0.995685
## extwallSIDG NO SHTG -8.505e+04 1.275e+05 -0.667 0.504625
## extwallSTONE 2.452e+05 1.091e+05 2.246 0.024687 *
## extwallSTUCCO HRDCT 4.713e+04 2.354e+04 2.003 0.045217 *
## extwallSTUCCO SYNTH 2.844e+04 5.328e+04 0.534 0.593525
## extwallWOOD ON SHTG -5.140e+03 1.381e+04 -0.372 0.709689
## extwallWOOD SHINGLE 9.147e+04 4.246e+04 2.154 0.031232 *
## foundationCRAWL SPACE 2.239e+04 1.913e+04 1.170 0.241883
## foundationPIER-NO FOUND WALL 3.057e+05 2.468e+05 1.239 0.215461
## foundationSLAB-ABV GRD 7.813e+04 7.859e+04 0.994 0.320152
## foundationSLAB-COM 4.041e+04 2.341e+05 0.173 0.862937
## foundationSLAB-RES 3.051e+04 1.977e+04 1.543 0.122753
## storyheigh1 STORY 1.997e+05 3.016e+05 0.662 0.507747
## storyheigh1.5 STORY 1.962e+05 3.016e+05 0.650 0.515450
## storyheigh2.0 STORY 1.386e+05 3.015e+05 0.460 0.645737
## storyheigh2.5 STORY 1.116e+05 3.018e+05 0.370 0.711698
## storyheigh3.0 STORY 4.713e+03 3.081e+05 0.015 0.987795
## storyheighBI-LEVEL 1.094e+05 3.042e+05 0.360 0.719126
## storyheighCAPE COD 1.602e+05 3.233e+05 0.495 0.620264
## storyheighRANCH W/BSMT 1.501e+05 3.026e+05 0.496 0.619759
## storyheighSPLIT LEVEL 1.338e+05 3.021e+05 0.443 0.657868
## heatedarea 1.101e+02 5.322e+00 20.688 < 2e-16 ***
## houseoldnew 3.721e+04 9.101e+03 4.089 4.34e-05 ***
## Income_2020 3.773e-01 1.084e-01 3.479 0.000503 ***
## Treecanopy_2012 -1.419e+03 2.401e+02 -5.907 3.50e-09 ***
## White_2020 2.129e+03 3.054e+02 6.972 3.17e-12 ***
## Black_2020 3.880e+02 3.082e+02 1.259 0.208100
## Bachelor_2020 2.810e+02 2.440e+02 1.152 0.249463
## Park.Buffer 2.143e+02 2.500e+02 0.857 0.391330
## schools.Buffer 8.114e+03 4.969e+02 16.331 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 521500 on 44501 degrees of freedom
## (1512 observations deleted due to missingness)
## Multiple R-squared: 0.2432, Adjusted R-squared: 0.2421
## F-statistic: 207.3 on 69 and 44501 DF, p-value: < 2.2e-16
##
## First Linear Model of Dataset
## ========================================================
## Dependent variable:
## ---------------------------
## price
## --------------------------------------------------------
## fullbaths 30,114.0***
## (5,841.3)
##
## halfbaths 27,350.8***
## (6,383.6)
##
## Age 1,605.7***
## (226.9)
##
## bedrooms 39,882.4***
## (4,227.3)
##
## numfirepla -5,383.9
## (6,187.8)
##
## bldggradeCUSTOM 1,493,290.0***
## (39,566.8)
##
## bldggradeEXCELLENT 713,937.0***
## (22,129.5)
##
## bldggradeFAIR -27,481.3
## (21,231.9)
##
## bldggradeGOOD 55,532.8***
## (7,579.0)
##
## bldggradeMINIMUM -52,796.2
## (114,717.7)
##
## bldggradeVERY GOOD 258,062.6***
## (12,380.9)
##
## heatedfuelGAS -18,409.7**
## (7,976.9)
##
## heatedfuelNONE -23,927.3
## (67,214.6)
##
## heatedfuelOIL/WD/COAL 65,694.1*
## (35,836.9)
##
## heatedfuelSOLAR/GEOTHRM -12,374.6
## (188,865.5)
##
## bedroomgroups -50,910.0***
## (9,582.1)
##
## yearbuilt
##
##
## actypeAC-CHLD WAT -102,532.3
## (301,271.0)
##
## actypeAC-NONE -43,087.9***
## (15,362.4)
##
## actypeAC-PCKD ROOF 10,537.4
## (521,566.2)
##
## actypeAC-WALL UNIT -48,029.6
## (91,498.8)
##
## actypeAIR-DUCTED -3,024.0
## (79,653.2)
##
## actypeAIR-NO-DUCT -37,375.6
## (184,472.8)
##
## actypeHEAT - NONE 111,654.2
## (374,964.3)
##
## actypeHEAT PUMP 54,579.7
## (68,098.5)
##
## extwallASB SHNG/SDG -82,509.8*
## (42,921.0)
##
## extwallBOARD&BATTEN 93,517.0
## (87,251.8)
##
## extwallCEDAR,RDWD 83,344.6***
## (26,750.4)
##
## extwallCEM BR/SPL B -26,295.3
## (91,120.6)
##
## extwallCOMP OR WLBD -34,473.8
## (233,718.5)
##
## extwallCONC BLOCK -3,424.0
## (91,745.9)
##
## extwallCORR MTL LGT -183,711.3
## (389,033.1)
##
## extwallEXT PLYWOOD 110,853.6**
## (43,662.0)
##
## extwallFACE BRICK -15,671.7*
## (8,586.2)
##
## extwallHARDIPLK/DSGN VINYL 5,782.8
## (8,311.4)
##
## extwallIMITATION STONE 141,335.7
## (369,002.9)
##
## extwallLOG 40,572.1
## (144,859.1)
##
## extwallMASONITE 2,746.9
## (10,095.3)
##
## extwallMODULAR MTL -86,905.7
## (309,464.6)
##
## extwallPREFIN MTL -57,130.5
## (301,458.1)
##
## extwallSDG MIN/NONE -1,995.1
## (368,919.0)
##
## extwallSIDG NO SHTG -85,050.7
## (127,467.3)
##
## extwallSTONE 245,184.7**
## (109,148.4)
##
## extwallSTUCCO HRDCT 47,134.3**
## (23,535.5)
##
## extwallSTUCCO SYNTH 28,436.2
## (53,277.3)
##
## extwallWOOD ON SHTG -5,140.0
## (13,806.8)
##
## extwallWOOD SHINGLE 91,471.0**
## (42,462.3)
##
## foundationCRAWL SPACE 22,385.4
## (19,127.8)
##
## foundationPIER-NO FOUND WALL 305,659.0
## (246,757.0)
##
## foundationSLAB-ABV GRD 78,131.3
## (78,590.5)
##
## foundationSLAB-COM 40,406.2
## (234,051.0)
##
## foundationSLAB-RES 30,511.9
## (19,769.9)
##
## storyheigh1 STORY 199,742.5
## (301,564.3)
##
## storyheigh1.5 STORY 196,172.1
## (301,626.8)
##
## storyheigh2.0 STORY 138,600.3
## (301,504.3)
##
## storyheigh2.5 STORY 111,551.2
## (301,833.0)
##
## storyheigh3.0 STORY 4,712.6
## (308,058.6)
##
## storyheighBI-LEVEL 109,410.8
## (304,231.6)
##
## storyheighCAPE COD 160,207.8
## (323,338.4)
##
## storyheighRANCH W/BSMT 150,135.4
## (302,573.2)
##
## storyheighSPLIT LEVEL 133,766.5
## (302,050.8)
##
## heatedarea 110.1***
## (5.3)
##
## houseoldnew 37,214.9***
## (9,101.2)
##
## Income_2020 0.4***
## (0.1)
##
## Treecanopy_2012 -1,418.6***
## (240.1)
##
## White_2020 2,129.0***
## (305.4)
##
## Black_2020 388.0
## (308.2)
##
## Bachelor_2020 281.0
## (244.0)
##
## Park.Buffer 214.3
## (250.0)
##
## schools.Buffer 8,114.2***
## (496.9)
##
## Constant -374,172.5
## (303,240.0)
##
## --------------------------------------------------------
## Observations 44,571
## R2 0.2
## Adjusted R2 0.2
## Residual Std. Error 521,476.8 (df = 44501)
## F Statistic 207.3*** (df = 69; 44501)
## ========================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
We split 70% of our data into the training set and 30% into the test set.
set.seed(3456)
trainIndex <- createDataPartition(HousingData_nozero$price, p = .7,
list = FALSE,
times = 1)
Train <- HousingData_nozero[ trainIndex,]
Test <- HousingData_nozero[-trainIndex,]
The adjusted R-Squared values indicate that our model accounts for 21% of th variance in home sale prices.
##
## Call:
## lm(formula = price ~ ., data = st_drop_geometry(Train) %>% dplyr::select(price,
## fullbaths, halfbaths, Age, bedrooms, numfirepla, bldggrade,
## heatedfuel, bedroomgroups, yearbuilt, actype, extwall, foundation,
## storyheigh, heatedarea, houseoldnew, Income_2020, Treecanopy_2012,
## White_2020, Black_2020, Bachelor_2020, Park.Buffer, schools.Buffer))
##
## Residuals:
## Min 1Q Median 3Q Max
## -3549771 -86480 -15434 57713 74234971
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.930e+05 3.253e+05 -1.208 0.22692
## fullbaths 3.005e+04 7.470e+03 4.023 5.76e-05 ***
## halfbaths 3.210e+04 8.181e+03 3.924 8.73e-05 ***
## Age 1.605e+03 2.930e+02 5.479 4.31e-08 ***
## bedrooms 5.922e+04 5.314e+03 11.144 < 2e-16 ***
## numfirepla -6.385e+03 7.922e+03 -0.806 0.42024
## bldggradeCUSTOM 1.540e+06 4.981e+04 30.918 < 2e-16 ***
## bldggradeEXCELLENT 7.050e+05 2.839e+04 24.831 < 2e-16 ***
## bldggradeFAIR -2.024e+04 2.740e+04 -0.739 0.46008
## bldggradeGOOD 5.847e+04 9.668e+03 6.048 1.48e-09 ***
## bldggradeMINIMUM -4.111e+04 1.450e+05 -0.283 0.77683
## bldggradeVERY GOOD 2.735e+05 1.577e+04 17.340 < 2e-16 ***
## heatedfuelGAS -2.520e+04 1.022e+04 -2.465 0.01371 *
## heatedfuelNONE -2.244e+04 8.908e+04 -0.252 0.80113
## heatedfuelOIL/WD/COAL 1.065e+05 4.613e+04 2.308 0.02102 *
## heatedfuelSOLAR/GEOTHRM 8.156e+02 2.164e+05 0.004 0.99699
## bedroomgroups -6.625e+04 1.224e+04 -5.412 6.27e-08 ***
## yearbuilt NA NA NA NA
## actypeAC-CHLD WAT -1.020e+05 3.222e+05 -0.317 0.75158
## actypeAC-NONE -4.035e+04 1.958e+04 -2.061 0.03935 *
## actypeAC-WALL UNIT -4.223e+04 1.105e+05 -0.382 0.70236
## actypeAIR-DUCTED 4.267e+03 1.020e+05 0.042 0.96662
## actypeAIR-NO-DUCT -5.988e+04 2.496e+05 -0.240 0.81039
## actypeHEAT PUMP -2.078e+04 9.456e+04 -0.220 0.82604
## extwallASB SHNG/SDG -1.048e+05 5.546e+04 -1.889 0.05885 .
## extwallBOARD&BATTEN 8.177e+04 1.167e+05 0.701 0.48362
## extwallCEDAR,RDWD 8.767e+04 3.531e+04 2.483 0.01304 *
## extwallCEM BR/SPL B -1.533e+04 1.283e+05 -0.119 0.90492
## extwallCOMP OR WLBD 3.825e+03 2.790e+05 0.014 0.98906
## extwallCONC BLOCK -7.168e+04 1.265e+05 -0.567 0.57092
## extwallCORR MTL LGT -1.850e+05 4.161e+05 -0.445 0.65668
## extwallEXT PLYWOOD 1.675e+05 5.500e+04 3.045 0.00233 **
## extwallFACE BRICK -1.980e+04 1.102e+04 -1.797 0.07230 .
## extwallHARDIPLK/DSGN VINYL 7.425e+03 1.061e+04 0.700 0.48385
## extwallIMITATION STONE 2.238e+05 5.579e+05 0.401 0.68826
## extwallLOG 2.113e+04 1.861e+05 0.114 0.90962
## extwallMASONITE -9.719e+03 1.292e+04 -0.752 0.45190
## extwallMODULAR MTL -1.030e+05 3.325e+05 -0.310 0.75670
## extwallPREFIN MTL -8.059e+04 3.225e+05 -0.250 0.80268
## extwallSDG MIN/NONE -2.092e+04 5.582e+05 -0.037 0.97010
## extwallSIDG NO SHTG -9.149e+04 1.621e+05 -0.564 0.57251
## extwallSTONE 3.197e+05 1.552e+05 2.060 0.03940 *
## extwallSTUCCO HRDCT 6.524e+04 3.019e+04 2.161 0.03071 *
## extwallSTUCCO SYNTH 3.258e+04 6.777e+04 0.481 0.63072
## extwallWOOD ON SHTG -8.937e+03 1.756e+04 -0.509 0.61075
## extwallWOOD SHINGLE 8.514e+04 5.306e+04 1.605 0.10859
## foundationCRAWL SPACE 3.276e+04 2.447e+04 1.339 0.18064
## foundationPIER-NO FOUND WALL 3.088e+05 2.643e+05 1.168 0.24267
## foundationSLAB-ABV GRD 1.105e+05 1.017e+05 1.087 0.27700
## foundationSLAB-COM 9.801e+04 3.952e+05 0.248 0.80412
## foundationSLAB-RES 3.678e+04 2.529e+04 1.454 0.14583
## storyheigh1 STORY 2.030e+05 3.227e+05 0.629 0.52935
## storyheigh1.5 STORY 1.774e+05 3.228e+05 0.550 0.58262
## storyheigh2.0 STORY 1.321e+05 3.226e+05 0.410 0.68216
## storyheigh2.5 STORY 1.032e+05 3.231e+05 0.319 0.74947
## storyheigh3.0 STORY -1.794e+04 3.321e+05 -0.054 0.95693
## storyheighBI-LEVEL 1.225e+05 3.271e+05 0.375 0.70802
## storyheighCAPE COD 1.416e+05 3.515e+05 0.403 0.68709
## storyheighRANCH W/BSMT 1.567e+05 3.243e+05 0.483 0.62884
## storyheighSPLIT LEVEL 1.254e+05 3.234e+05 0.388 0.69816
## heatedarea 1.001e+02 6.861e+00 14.595 < 2e-16 ***
## houseoldnew 4.540e+04 1.166e+04 3.893 9.91e-05 ***
## Income_2020 2.835e-01 1.381e-01 2.053 0.04011 *
## Treecanopy_2012 -1.485e+03 3.075e+02 -4.831 1.37e-06 ***
## White_2020 2.138e+03 3.915e+02 5.460 4.79e-08 ***
## Black_2020 4.270e+02 3.957e+02 1.079 0.28049
## Bachelor_2020 5.393e+02 3.116e+02 1.731 0.08353 .
## Park.Buffer 2.029e+02 3.205e+02 0.633 0.52660
## schools.Buffer 8.153e+03 6.365e+02 12.809 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 557600 on 31095 degrees of freedom
## (1096 observations deleted due to missingness)
## Multiple R-squared: 0.2215, Adjusted R-squared: 0.2198
## F-statistic: 132 on 67 and 31095 DF, p-value: < 2.2e-16
Our mean MAE - price absolute error was 139,410 and our MAPE was 34%.
## Error in `stopifnot()`:
## ! Problem while computing `Price.Predict = predict(reg_training, Test)`.
## Caused by error in `model.frame.default()`:
## ! factor actype has new levels AC-PCKD ROOF, HEAT - NONE
fitControl <- trainControl(method = "cv", number = 100)
set.seed(825)
reg.cv <-
train(price ~ ., data = st_drop_geometry(Train) %>%
dplyr::select(price, fullbaths, halfbaths, Age, bedrooms, numfirepla, bedroomgroups, yearbuilt, heatedarea, houseoldnew, Income_2020, Treecanopy_2012, White_2020, Black_2020, Bachelor_2020, Park.Buffer, schools.Buffer),
method = "lm", trControl = fitControl, na.action = na.pass)
reg.cv
## Linear Regression
##
## 32259 samples
## 16 predictor
##
## No pre-processing
## Resampling: Cross-Validated (100 fold)
## Summary of sample sizes: 31936, 31935, 31937, 31937, 31939, 31935, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 316154.6 0.564399 136358.2
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
reg.cv$resample[1:5,]
## RMSE Rsquared MAE Resample
## 1 303513.5 0.5898570 148043.9 Fold001
## 2 234318.9 0.6623346 144317.7 Fold002
## 3 178858.0 0.6597736 120258.3 Fold003
## 4 170824.6 0.6117729 117214.9 Fold004
## 5 250626.6 0.6431760 129566.3 Fold005
mean(reg.cv$resample[,3])
## [1] 136358.2
##
## Cross Validation Results
## =======================================================
## Statistic N Mean St. Dev. Min Max
## -------------------------------------------------------
## RMSE 100 316,154.6 481,573.2 147,676.6 4,276,068.0
## Rsquared 100 0.6 0.2 0.01 0.7
## MAE 100 136,358.2 29,890.2 110,452.0 358,648.6
## -------------------------------------------------------
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Spatial Lag
coords <- st_coordinates(Train)
neighborList <- knn2nb(knearneigh(coords, 5))
spatialWeights <- nb2listw(neighborList, style="W")
Train$lagPrice <- lag.listw(spatialWeights, Train$price)
coords.test <- st_coordinates(Test)
neighborList.test <- knn2nb(knearneigh(coords.test, 5))
spatialWeights.test <- nb2listw(neighborList.test, style="W")
Our Moran’s I is 0.27, with a p-value of 0.001. The Moran’s I output suggests that there is no autocorrelation. However, a p-value of 0.001 suggests that the observed point process is more clustered than all 999 random permutations (1 / 999 = 0.001) and is statistically significant.
Test_lag <-
Test %>%
mutate(lagPriceError = lag.listw(spatialWeights.test, Price.Error, NAOK = TRUE))
## Error in `stopifnot()`:
## ! Problem while computing `lagPriceError =
## lag.listw(spatialWeights.test, Price.Error, NAOK = TRUE)`.
## Caused by error in `lag.listw()`:
## ! object 'Price.Error' not found
Test_filter <- Test_lag %>% filter(Price.Error >0,lagPriceError >0)
## Error in filter(., Price.Error > 0, lagPriceError > 0): object 'Test_lag' not found
## Error in ggplot(data = Test_filter, aes(lagPriceError, Price.Error)): object 'Test_filter' not found
MoranTest <-moran.mc(na.omit(Test_lag$Price.Error),
spatialWeights.test, nsim = 999, zero.policy = TRUE)
## Error in na.omit(Test_lag$Price.Error): object 'Test_lag' not found
ggplot(as.data.frame(MoranTest$res[c(1:999)]), aes(MoranTest$res[c(1:999)])) +
geom_histogram(binwidth = 0.01) +
geom_vline(aes(xintercept = MoranTest$statistic), colour = "pink",size=1) +
scale_x_continuous(limits = c(-1, 1)) +
labs(title="Observed and permuted Moran's I",
subtitle= "Observed Moran's I in pink",
x="Moran's I",
y="Count") +
plotTheme()
## Error in as.data.frame(MoranTest$res[c(1:999)]): object 'MoranTest' not found
We included a spatial variable to account for the spatial process at the neighborhood or local scale. We used census tracts to organize the grouping of homes into “neighborhoods”, to understand the neighborhood effects on home prices.
Test_lag %>%
st_drop_geometry() %>%
group_by(NAME)%>% summarize(meanPrice = mean(price, na.rm = T),
meanPrediction = mean(Price.Predict)) %>%
kable(caption = "Price and Prediction by block group") %>% kable_material_dark("basic", full_width = F) %>% scroll_box(width = "800px", height = "500px")
## Error in st_drop_geometry(.): object 'Test_lag' not found
ggplot() +
geom_sf(data = Mecklenburg, fill = "grey89", color = "grey89") +
geom_sf(data = Test_lag, aes(colour = q5(Price.Predict)),
show.legend = "point", size = .75) +
scale_colour_manual(values = palette11,
labels=qBr(Test_lag,"Price.Predict"),
name="Quintile\nBreaks") +
labs(title="Predicted Sale Price, Mecklenburg County, NC") +
mapTheme()
## Error in fortify(data): object 'Test_lag' not found
Test_lag <-
Test_lag %>%
group_by(NAME) %>%
mutate(MeanPrice = mean(price))
## Error in group_by(., NAME): object 'Test_lag' not found
Test_lag <-
Test_lag %>%
group_by(NAME) %>%
mutate(MAPE = mean(Price.APE))
## Error in group_by(., NAME): object 'Test_lag' not found
ggplot(Test_lag) +
geom_point(aes(MeanPrice, MAPE)) +
geom_smooth(method = "lm", aes(MeanPrice, MAPE), se = FALSE, colour = "pink") +
labs(title = "MAPE by Census Tracts as a Function of Mean Price by Tract",
x = "Mean Home Price", y = "MAPE") +
plotTheme()
## Error in ggplot(Test_lag): object 'Test_lag' not found
# Mean
st_drop_geometry(Test_lag) %>%
group_by(NAME) %>%
summarize(MAPE = mean(Price.APE, na.rm = T)) %>%
ungroup() %>%
left_join(Mck_tracts) %>%
st_sf() %>%
ggplot() +
geom_sf(aes(fill = MAPE)) +
scale_colour_manual(values = palette11,
name = "MAPE") +
geom_sf(data = Test_lag, colour = "grey", show.legend = "point", size = .5) +
labs(title = "Mean test set MAPE by Block Groups") +
mapTheme()
## Error in st_drop_geometry(Test_lag): object 'Test_lag' not found
We studied the generalizability across urban contexts using race and income as metrics. One can see that Mecklenburg is segregated by income and race.
ggplot() + geom_sf(data = na.omit(Mck_tracts), aes(fill = raceContext)) +
scale_fill_manual(values = palette11, name="Race Context") +
labs(title = "Race in Mecklenburg County,NC") +
mapTheme() + theme(legend.position="bottom")
ggplot() + geom_sf(data = na.omit(Mck_tracts), aes(fill = incomeContext)) +
scale_fill_manual(values = c(palette11), name="Income Context") +
labs(title = "Median Household Income in Mecklenburg County, NC") +
mapTheme() + theme(legend.position="bottom")
#palette11<- c("#f3e79b","#fac484","#f8a07e","#eb7f86","#ce6693","#a059a0","#5c53a5")
Challenge_Meclenburg <-
Challenge_Meclenburg %>%
mutate(Price.Predict = predict(reg_training, Challenge_Meclenburg),
Price.Error = Price.Predict - price,
Price.AbsError = abs(Price.Predict - price),
Price.APE = (abs(Price.Predict - price)) / Price.Predict)
#PredictionsExport <- subset(Predictions, select = c("MUSA_ID", "Price.Predict")) %>% st_drop_geometry()
In general, our model produced relatively well. 21% of our model can explain the variance of sale prices in Mecklenburg County, with an average sale price error of roughly 130K. We feel that our model is decent at indicating the prediction of sale prices, but maybe not as generalizable in predicting houses in a different regional context. Additionally, It doesn’t capture the nuances within the local neighborhood context.
Variables that provided interesting insight and proceed to be rather significant were the physical characteristics of the house. More obviously, variables like full bathrooms, the number of bedrooms, the age of the house, and the quality of the building grade proved to be significant indicators. Some of the spatial amenities that we added also proved to be a significant indicator in predicting sale prices, including the percentage of white and black residents in an area, and median household income. However, the white population was far more significant than the black population in predicting the sale price of a home. To our surprise, our added variable of residential tree canopy and school buffers were highly significant indicators.
We would not recommend our model to Zillow as it currently stands. Our model would benefit from further transformation and categorization of the home characteristics within the housing data set. Attributes related to the square footage of the home and lot could have been interesting to add to the model. Additionally, in a car-dependent county such as Mecklenburg, we felt that information on garages and parking would have an impact on home prices and individual decisions and preferences. Other external variables such as crime, grocery stores, and flood data would have been good variables to add to the model. Furthermore, our model’s generalizability across space was not as strong as it could be. We are also missing some spatial nuances that impact home prices that could have been achieved through more thorough spatial processes.
The biggest challenge we faced in this project was in debugging our code and dealing with unforeseen errors with the R studio interface. We ran into several situations where our code would initially run, and suddenly stop after. This disallowed us to explore certain processes further due to time constraints. Overall, through this exercise, we were introduced to new tools such as geospatial machine learning and challenged the boundaries of our understanding as well as our capabilities in R.